The purpose of this project is to use regression to predict the number of shares of articles by mashable.

Dataset source: http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

abstract: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Data Set Characteristics: Multivariate Number of Instances: 39797 Area: Business

Attribute Characteristics:Integer, Real

Number of Attributes:61

Date Donated: 2015-05-31

Data Set Information:

Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field).

Attribute Information: 0. url: URL of the article (non-predictive) – not to be included in the analysis 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) – not to be included in the analysis 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the content 5. n_non_stop_words: Rate of non-stop words in the content Examples of stop words: Determiners- Determiners tend to mark nouns where a determiner usually will be followed by a noun examples: the, a, an, another Coordinating conjunctions– Coordinating conjunctions connect words, phrases, and clauses examples: for, an, nor, but, or, yet, so Prepositions- Prepositions express temporal or spatial relations examples: in, under, towards, before

  1. n_non_stop_unique_tokens: Rate of unique non-stop words in the content

  2. num_hrefs: Number of links

  3. num_self_hrefs: Number of links to other articles published by Mashable

  4. num_imgs: Number of images

  5. num_videos: Number of videos

  6. average_token_length: Average length of the words in the content

  7. num_keywords: Number of keywords in the metadata

  8. data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
  9. data_channel_is_entertainment: Is data channel ‘Entertainment’?
  10. data_channel_is_bus: Is data channel ‘Business’?
  11. data_channel_is_socmed: Is data channel ‘Social Media’?
  12. data_channel_is_tech: Is data channel ‘Tech’?
  13. data_channel_is_world: Is data channel ‘World’?

  14. kw_min_min: Worst keyword (min. shares)
  15. kw_max_min: Worst keyword (max. shares)
  16. kw_avg_min: Worst keyword (avg. shares)
  17. kw_min_max: Best keyword (min. shares)
  18. kw_max_max: Best keyword (max. shares)
  19. kw_avg_max: Best keyword (avg. shares)
  20. kw_min_avg: Avg. keyword (min. shares)
  21. kw_max_avg: Avg. keyword (max. shares)
  22. kw_avg_avg: Avg. keyword (avg. shares)

  23. self_reference_min_shares: Min. shares of referenced articles in Mashable
  24. self_reference_max_shares: Max. shares of referenced articles in Mashable
  25. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable

  26. weekday_is_monday: Was the article published on a Monday?
  27. weekday_is_tuesday: Was the article published on a Tuesday?
  28. weekday_is_wednesday: Was the article published on a Wednesday?
  29. weekday_is_thursday: Was the article published on a Thursday?
  30. weekday_is_friday: Was the article published on a Friday?
  31. weekday_is_saturday: Was the article published on a Saturday?
  32. weekday_is_sunday: Was the article published on a Sunday?
  33. is_weekend: Was the article published on the weekend?

  34. LDA_00: Closeness to LDA topic 0
  35. LDA_01: Closeness to LDA topic 1
  36. LDA_02: Closeness to LDA topic 2
  37. LDA_03: Closeness to LDA topic 3
  38. LDA_04: Closeness to LDA topic 4

  39. global_subjectivity: Text subjectivity
  40. global_sentiment_polarity: Text sentiment polarity
  41. global_rate_positive_words: Rate of positive words in the content
  42. global_rate_negative_words: Rate of negative words in the content
  43. rate_positive_words: Rate of positive words among non-neutral tokens
  44. rate_negative_words: Rate of negative words among non-neutral tokens

  45. avg_positive_polarity: Avg. polarity of positive words
  46. min_positive_polarity: Min. polarity of positive words
  47. max_positive_polarity: Max. polarity of positive words
  48. avg_negative_polarity: Avg. polarity of negative words
  49. min_negative_polarity: Min. polarity of negative words
  50. max_negative_polarity: Max. polarity of negative words

  51. title_subjectivity: Title subjectivity
  52. title_sentiment_polarity: Title polarity
  53. abs_title_subjectivity: Absolute subjectivity level
  54. abs_title_sentiment_polarity: Absolute polarity level

  55. shares: Number of shares (target)

We have based some of our analysis on the paper “Predicting and Evaluating the Popularity of Online News” that can be found here: http://cs229.stanford.edu/proj2015/328_report.pdf

Start by reading the data and have a look at its structure:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
data = read.csv("OnlineNewsPopularity.csv")
str(data)
## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...
set.seed(27)
#Create a sample of the data to reduce the time of processing it
dsmall <- sample_n(data, 10000)

Exploratory Analysis of Variables

First, generate a global summary of the dataset to understand how the values are distributed

summary(dsmall)
##                                                              url      
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :   1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:   1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :   1  
##  http://mashable.com/2013/01/07/bodymedia-armbandgets-update/  :   1  
##  http://mashable.com/2013/01/07/cosmic-events-doomsday/        :   1  
##  http://mashable.com/2013/01/07/earth-size-planets-milky-way/  :   1  
##  (Other)                                                       :9994  
##    timedelta     n_tokens_title n_tokens_content n_unique_tokens 
##  Min.   :  8.0   Min.   : 2.0   Min.   :   0.0   Min.   :0.0000  
##  1st Qu.:164.0   1st Qu.: 9.0   1st Qu.: 245.0   1st Qu.:0.4707  
##  Median :338.0   Median :10.0   Median : 408.0   Median :0.5410  
##  Mean   :355.3   Mean   :10.4   Mean   : 541.7   Mean   :0.5320  
##  3rd Qu.:543.0   3rd Qu.:12.0   3rd Qu.: 709.2   3rd Qu.:0.6104  
##  Max.   :731.0   Max.   :19.0   Max.   :7185.0   Max.   :1.0000  
##                                                                  
##  n_non_stop_words n_non_stop_unique_tokens   num_hrefs     
##  Min.   :0.0000   Min.   :0.0000           Min.   :  0.00  
##  1st Qu.:1.0000   1st Qu.:0.6261           1st Qu.:  4.00  
##  Median :1.0000   Median :0.6910           Median :  7.00  
##  Mean   :0.9713   Mean   :0.6741           Mean   : 10.83  
##  3rd Qu.:1.0000   3rd Qu.:0.7557           3rd Qu.: 14.00  
##  Max.   :1.0000   Max.   :1.0000           Max.   :145.00  
##                                                            
##  num_self_hrefs      num_imgs         num_videos     average_token_length
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   :0.000       
##  1st Qu.:  1.00   1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:4.479       
##  Median :  3.00   Median :  1.000   Median : 0.000   Median :4.666       
##  Mean   :  3.31   Mean   :  4.494   Mean   : 1.236   Mean   :4.552       
##  3rd Qu.:  4.00   3rd Qu.:  4.000   3rd Qu.: 1.000   3rd Qu.:4.855       
##  Max.   :116.00   Max.   :111.000   Max.   :91.000   Max.   :7.696       
##                                                                          
##   num_keywords    data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   : 1.000   Min.   :0.0000            Min.   :0.0000               
##  1st Qu.: 6.000   1st Qu.:0.0000            1st Qu.:0.0000               
##  Median : 7.000   Median :0.0000            Median :0.0000               
##  Mean   : 7.223   Mean   :0.0547            Mean   :0.1788               
##  3rd Qu.: 9.000   3rd Qu.:0.0000            3rd Qu.:0.0000               
##  Max.   :10.000   Max.   :1.0000            Max.   :1.0000               
##                                                                          
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.0000         Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.0000         1st Qu.:0.0000      
##  Median :0.0000      Median :0.0000         Median :0.0000      
##  Mean   :0.1543      Mean   :0.0594         Mean   :0.1859      
##  3rd Qu.:0.0000      3rd Qu.:0.0000         3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.0000         Max.   :1.0000      
##                                                                 
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   444   1st Qu.:  138.9  
##  Median :0.0000        Median : -1.00   Median :   653   Median :  233.3  
##  Mean   :0.2137        Mean   : 26.43   Mean   :  1183   Mean   :  320.2  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  352.6  
##  Max.   :1.0000        Max.   :377.00   Max.   :138700   Max.   :34855.1  
##                                                                           
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:171300   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :245263   Median :1022  
##  Mean   : 14201   Mean   :751491   Mean   :259888   Mean   :1112  
##  3rd Qu.:  7825   3rd Qu.:843300   3rd Qu.:332258   3rd Qu.:2051  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3607  
##                                                                   
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3564   1st Qu.: 2372   1st Qu.:   629           
##  Median :  4332   Median : 2871   Median :  1200           
##  Mean   :  5755   Mean   : 3148   Mean   :  4027           
##  3rd Qu.:  6016   3rd Qu.: 3591   3rd Qu.:  2600           
##  Max.   :237967   Max.   :37608   Max.   :690400           
##                                                            
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0.0           Min.   :0.0000   
##  1st Qu.:  1100            1st Qu.:   979.6           1st Qu.:0.0000   
##  Median :  2800            Median :  2200.0           Median :0.0000   
##  Mean   : 10771            Mean   :  6618.6           Mean   :0.1692   
##  3rd Qu.:  7925            3rd Qu.:  5200.0           3rd Qu.:0.0000   
##  Max.   :843300            Max.   :690400.0           Max.   :1.0000   
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1894     Mean   :0.1885       Mean   :0.1805     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.0000      Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.0000      1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :0.0000    Median :0.0000      Median :0.0000    Median :0.0000  
##  Mean   :0.1402    Mean   :0.0624      Mean   :0.0698    Mean   :0.1322  
##  3rd Qu.:0.0000    3rd Qu.:0.0000      3rd Qu.:0.0000    3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.0000      Max.   :1.0000    Max.   :1.0000  
##                                                                          
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.01818   Min.   :0.01819   Min.   :0.01818   Min.   :0.01818  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02593  
##  Median :0.03339   Median :0.03335   Median :0.04001   Median :0.04000  
##  Mean   :0.18271   Mean   :0.14098   Mean   :0.21778   Mean   :0.22391  
##  3rd Qu.:0.23935   3rd Qu.:0.15017   3rd Qu.:0.33654   3rd Qu.:0.36705  
##  Max.   :0.92000   Max.   :0.92595   Max.   :0.92000   Max.   :0.92554  
##                                                                         
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.01818   Min.   :0.0000      Min.   :-0.37766         
##  1st Qu.:0.02857   1st Qu.:0.3942      1st Qu.: 0.05604         
##  Median :0.04282   Median :0.4538      Median : 0.11861         
##  Mean   :0.23462   Mean   :0.4438      Mean   : 0.11873         
##  3rd Qu.:0.39910   3rd Qu.:0.5098      3rd Qu.: 0.17797         
##  Max.   :0.91999   Max.   :1.0000      Max.   : 0.62500         
##                                                                 
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02820            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03891            Median :0.015409           Median :0.7074     
##  Mean   :0.03952            Mean   :0.016647           Mean   :0.6820     
##  3rd Qu.:0.05021            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15217            Max.   :0.184932           Max.   :1.0000     
##                                                                           
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3042        1st Qu.:0.05000      
##  Median :0.2857      Median :0.3583        Median :0.10000      
##  Mean   :0.2891      Mean   :0.3538        Mean   :0.09686      
##  3rd Qu.:0.3846      3rd Qu.:0.4132        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##                                                                 
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3304       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2542       Median :-0.5000      
##  Mean   :0.7535        Mean   :-0.2614       Mean   :-0.5225      
##  3rd Qu.:1.0000        3rd Qu.:-0.1883       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##                                                                   
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1250     Median : 0.00000        
##  Mean   :-0.1099       Mean   :0.2782     Mean   : 0.06865        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.13636        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##                                                                   
##  abs_title_subjectivity abs_title_sentiment_polarity     shares      
##  Min.   :0.0000         Min.   :0.0000               Min.   :     5  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   945  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3439         Mean   :0.1554               Mean   :  3211  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2700  
##  Max.   :0.5000         Max.   :1.0000               Max.   :652900  
## 

The main finding of this preliminary analysis is that there is a big difference between ther 3rd Quartile (2700) of the number of shares and its maximum (652900). Having some extreme values for the number of shares will most probably generate a negative effect on the quality of the predictions that we will make.

As a second step of the exploratory analysis, we generated histograms to understand the way different variables are distributed. We didn’t perform the analysis for binary variables because histograms don’t add much value to understanding their distribution

hist(dsmall$ n_tokens_title)

hist(dsmall$ n_tokens_content)

hist(dsmall$ n_unique_tokens)

hist(dsmall$ n_non_stop_words)

hist(dsmall$ n_non_stop_unique_tokens)

hist(dsmall$ num_hrefs)

hist(dsmall$ num_self_hrefs)

hist(dsmall$ num_imgs)

hist(dsmall$ num_videos)

hist(dsmall$ average_token_length)

hist(dsmall$ num_keywords)

hist(dsmall$ kw_min_min)

hist(dsmall$ kw_max_min)

hist(dsmall$ kw_avg_min)

hist(dsmall$ kw_min_max)

hist(dsmall$ kw_max_max)

hist(dsmall$ kw_avg_max)

hist(dsmall$ kw_min_avg)

hist(dsmall$ kw_max_avg)

hist(dsmall$ kw_avg_avg)

hist(dsmall$ self_reference_min_shares)

hist(dsmall$ self_reference_max_shares)

hist(dsmall$ self_reference_avg_sharess)

hist(dsmall$ LDA_00)

hist(dsmall$ LDA_01)

hist(dsmall$ LDA_02)

hist(dsmall$ LDA_03)

hist(dsmall$ LDA_04)

hist(dsmall$ global_subjectivity)

hist(dsmall$ global_sentiment_polarity)

hist(dsmall$ global_rate_positive_words)

hist(dsmall$ global_rate_negative_words)

hist(dsmall$ rate_positive_words)

hist(dsmall$ rate_negative_words)

hist(dsmall$ avg_positive_polarity)

hist(dsmall$ min_positive_polarity)

hist(dsmall$ max_positive_polarity)

hist(dsmall$ avg_negative_polarity)

hist(dsmall$ min_negative_polarity)

hist(dsmall$ max_negative_polarity)

hist(dsmall$ title_subjectivity)

hist(dsmall$ title_sentiment_polarity)

hist(dsmall$ abs_title_subjectivity)

hist(dsmall$ abs_title_sentiment_polarity)

qplot(shares, data=dsmall, geom="histogram",xlim=c(0,40000),ylim=c(0,1500))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 63 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

As a main conclusion, we confirm that the number of shares is mainly very low, but that there are very few exceptions that have a very high number of shares.

Identify Regression Variables

Since we have a large number of variables, we will start by creating different variable categories and perform regression analysis on each block. We are doing this to reduce the computational time of running the analysis with a very large number of variables and also as a way to understand better the type of variables the dataset has.

We will pick the best variables of each block and take them into the final pool of variables for running a global model.

Example: Create a Words category that includes all variables related to words: tokens title, tokens content, non-stop words..

Here we will create datasets for all the categories with the variables:

words = dsmall %>%
  select(n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, average_token_length, shares)

links = dsmall %>%  select( num_hrefs, num_self_hrefs,self_reference_min_shares,
 self_reference_max_shares, self_reference_avg_sharess, shares)

media = dsmall %>%
  select(num_imgs,num_videos,shares)

time = dsmall %>%
  select( weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday, weekday_is_friday, weekday_is_saturday, weekday_is_sunday, is_weekend, shares)

keywords = dsmall %>%
  select( num_keywords, data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world, kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares)
  
nlp = dsmall %>%
  select( LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity, avg_negative_polarity, min_negative_polarity, max_negative_polarity, title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, shares)

We now have 5 categories: Words, Links, Media, Time, Keywords and NLP (for natural language processing).

We will now start by running the regression method on each of the categories in order to determine the most significant variables and discard the least significant ones

We will be using a maximum tolerable p-value of 5% to select the variables that we would like to keep for building the global model.

Block: Words

m1 = lm(data = words, shares ~ .)
summary(m1)
## 
## Call:
## lm(formula = shares ~ ., data = words)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5445  -2238  -1533   -324 648776 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.731e+03  7.691e+02   6.151 7.99e-10 ***
## n_tokens_title            1.779e+01  4.633e+01   0.384   0.7010    
## n_tokens_content          7.600e-01  3.265e-01   2.327   0.0200 *  
## n_unique_tokens           1.585e+04  2.637e+03   6.010 1.92e-09 ***
## n_non_stop_words          4.116e+03  2.137e+03   1.926   0.0541 .  
## n_non_stop_unique_tokens -1.159e+04  2.237e+03  -5.181 2.25e-07 ***
## average_token_length     -1.479e+03  3.661e+02  -4.041 5.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9703 on 9993 degrees of freedom
## Multiple R-squared:  0.005427,   Adjusted R-squared:  0.00483 
## F-statistic: 9.087 on 6 and 9993 DF,  p-value: 6.162e-10
#Remove the variables with a p-value > 0.05 and re-run the model
summary(lm(data=words, shares~n_unique_tokens+ n_non_stop_unique_tokens+ average_token_length+ n_tokens_content ))
## 
## Call:
## lm(formula = shares ~ n_unique_tokens + n_non_stop_unique_tokens + 
##     average_token_length + n_tokens_content, data = words)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6104  -2218  -1559   -345 648999 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5227.9090   551.7818   9.475  < 2e-16 ***
## n_unique_tokens          15420.7248  2628.0951   5.868 4.56e-09 ***
## n_non_stop_unique_tokens -9929.6075  2068.4728  -4.800 1.61e-06 ***
## average_token_length      -888.5832   193.7034  -4.587 4.54e-06 ***
## n_tokens_content             0.9554     0.3109   3.073  0.00212 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9704 on 9995 degrees of freedom
## Multiple R-squared:  0.005033,   Adjusted R-squared:  0.004635 
## F-statistic: 12.64 on 4 and 9995 DF,  p-value: 2.921e-10
We will be keeping the following variables for the model: n_unique_tokens, n_non_stop_unique_tokens, average_token_length, n_tokens_content

Block: Media

m3 = lm(data = media, shares ~ .)
summary(m3)
## 
## Call:
## lm(formula = shares ~ ., data = media)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5929  -2199  -1712   -449 649826 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2970.84     115.42  25.738  < 2e-16 ***
## num_imgs       39.22      11.70   3.352 0.000804 ***
## num_videos     51.55      23.97   2.150 0.031565 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9720 on 9997 degrees of freedom
## Multiple R-squared:  0.001485,   Adjusted R-squared:  0.001285 
## F-statistic: 7.433 on 2 and 9997 DF,  p-value: 0.0005947

Both variables have a p-value < 0.05, so we will keep them both.

Block: Time

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
m4 = lm(data = time, shares ~ .)
summary(m4)
## 
## Call:
## lm(formula = shares ~ ., data = time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3760  -2230  -1758   -495 649362 
## 
## Coefficients: (2 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3706.6      368.1  10.069   <2e-16 ***
## weekday_is_monday      -168.4      437.5  -0.385   0.7003    
## weekday_is_tuesday     -711.3      430.6  -1.652   0.0986 .  
## weekday_is_wednesday   -738.5      430.9  -1.714   0.0866 .  
## weekday_is_thursday    -682.0      433.5  -1.573   0.1157    
## weekday_is_friday      -549.0      450.5  -1.219   0.2230    
## weekday_is_saturday     107.1      535.8   0.200   0.8415    
## weekday_is_sunday          NA         NA      NA       NA    
## is_weekend                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9725 on 9993 degrees of freedom
## Multiple R-squared:  0.0008936,  Adjusted R-squared:  0.0002937 
## F-statistic:  1.49 on 6 and 9993 DF,  p-value: 0.1772
#since we have a lot of variables that are not significant, we will run the step-wise function to identify the most significant variables.

null= lm(data=time, shares ~ 1)  
full = lm(data=time, shares ~ .) 

step = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")
## Start:  AIC=183653.2
## shares ~ 1
## 
##                        Df Sum of Sq        RSS    AIC
## + is_weekend            1 454744773 9.4550e+11 183650
## + weekday_is_saturday   1 241926056 9.4571e+11 183653
## + weekday_is_monday     1 218323983 9.4574e+11 183653
## <none>                              9.4596e+11 183653
## + weekday_is_sunday     1 184453042 9.4577e+11 183653
## + weekday_is_wednesday  1 136829546 9.4582e+11 183654
## + weekday_is_tuesday    1 108469518 9.4585e+11 183654
## + weekday_is_thursday   1  76359028 9.4588e+11 183654
## + weekday_is_friday     1   4618178 9.4595e+11 183655
## 
## Step:  AIC=183650.4
## shares ~ is_weekend
## 
##                        Df Sum of Sq        RSS    AIC
## + weekday_is_monday     1 354428670 9.4515e+11 183649
## <none>                              9.4550e+11 183650
## + weekday_is_wednesday  1  61241203 9.4544e+11 183652
## + weekday_is_tuesday    1  42360690 9.4546e+11 183652
## + weekday_is_thursday   1  24160547 9.4548e+11 183652
## + weekday_is_saturday   1   3780575 9.4550e+11 183652
## + weekday_is_sunday     1   3780575 9.4550e+11 183652
## + weekday_is_friday     1   1506309 9.4550e+11 183652
## 
## Step:  AIC=183648.6
## shares ~ is_weekend + weekday_is_monday
## 
##                        Df Sum of Sq        RSS    AIC
## <none>                              9.4515e+11 183649
## + weekday_is_friday     1  29402354 9.4512e+11 183650
## + weekday_is_wednesday  1   9298260 9.4514e+11 183651
## + weekday_is_saturday   1   3780575 9.4514e+11 183651
## + weekday_is_sunday     1   3780575 9.4514e+11 183651
## + weekday_is_tuesday    1   2790474 9.4514e+11 183651
## + weekday_is_thursday   1     29884 9.4515e+11 183651
#We will run the model again with the chosen variables
summary(lm(data = time , shares ~ is_weekend + weekday_is_monday ))
## 
## Call:
## lm(formula = shares ~ is_weekend + weekday_is_monday, data = time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3703  -2238  -1728   -528 649362 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3028.1      116.3  26.030   <2e-16 ***
## is_weekend           729.0      291.6   2.500   0.0124 *  
## weekday_is_monday    510.1      263.5   1.936   0.0529 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9723 on 9997 degrees of freedom
## Multiple R-squared:  0.0008554,  Adjusted R-squared:  0.0006555 
## F-statistic: 4.279 on 2 and 9997 DF,  p-value: 0.01388
#This model shows a high significance for is_weekend, so we will drop weekday_is_monday
detach("package:MASS")

Block: Keywords

m5 = lm(data = keywords, shares ~ .)
summary(m5)
## 
## Call:
## lm(formula = shares ~ ., data = keywords)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23591  -2002  -1224   -214 647794 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -3.590e+01  9.645e+02  -0.037 0.970308    
## num_keywords                   6.024e+01  6.061e+01   0.994 0.320310    
## data_channel_is_lifestyle     -9.323e+02  5.189e+02  -1.797 0.072384 .  
## data_channel_is_entertainment -1.255e+03  3.923e+02  -3.200 0.001377 ** 
## data_channel_is_bus           -4.875e+02  4.013e+02  -1.215 0.224505    
## data_channel_is_socmed        -3.775e+02  5.103e+02  -0.740 0.459495    
## data_channel_is_tech          -5.829e+02  4.050e+02  -1.439 0.150136    
## data_channel_is_world         -9.237e+02  4.215e+02  -2.191 0.028441 *  
## kw_min_min                     1.732e+00  2.669e+00   0.649 0.516375    
## kw_max_min                     2.529e-02  8.650e-02   0.292 0.769974    
## kw_avg_min                    -3.689e-01  4.490e-01  -0.822 0.411305    
## kw_min_max                    -3.647e-03  1.851e-03  -1.970 0.048847 *  
## kw_max_max                    -3.023e-04  9.414e-04  -0.321 0.748140    
## kw_avg_max                     1.320e-03  1.304e-03   1.012 0.311367    
## kw_min_avg                    -4.526e-01  1.230e-01  -3.679 0.000236 ***
## kw_max_avg                    -1.706e-01  3.817e-02  -4.468 7.98e-06 ***
## kw_avg_avg                     1.574e+00  2.217e-01   7.096 1.37e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9632 on 9983 degrees of freedom
## Multiple R-squared:  0.02092,    Adjusted R-squared:  0.01935 
## F-statistic: 13.33 on 16 and 9983 DF,  p-value: < 2.2e-16

Since we have a lot of variables in this category, we will try to divide it further into two categories: type and keys

type = dsmall %>%
  select(data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world, shares)

m5 = lm(data = type, shares ~ .)
summary(m5)
## 
## Call:
## lm(formula = shares ~ ., data = type)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5457  -1988  -1397   -339 649673 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5478.6      247.2  22.165  < 2e-16 ***
## data_channel_is_lifestyle      -2239.4      481.9  -4.647 3.41e-06 ***
## data_channel_is_entertainment  -2881.7      336.8  -8.556  < 2e-16 ***
## data_channel_is_bus            -2251.6      348.9  -6.453 1.15e-10 ***
## data_channel_is_socmed         -2015.7      467.6  -4.311 1.64e-05 ***
## data_channel_is_tech           -2556.8      333.8  -7.659 2.05e-14 ***
## data_channel_is_world          -3217.5      323.9  -9.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9674 on 9993 degrees of freedom
## Multiple R-squared:  0.01128,    Adjusted R-squared:  0.01069 
## F-statistic: 19.01 on 6 and 9993 DF,  p-value: < 2.2e-16

From m5, we can see that all the variables now have a p-value < 0.05. So we will not discard any of them.

keys = dsmall %>%
  select(kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares)

m6 = lm(data = keys, shares ~ .)
summary(m6)
## 
## Call:
## lm(formula = shares ~ ., data = keys)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26141  -2050  -1272   -180 647700 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.987e+02  7.438e+02  -0.805   0.4209    
## kw_min_min   2.079e+00  2.667e+00   0.779   0.4357    
## kw_max_min   3.457e-02  8.555e-02   0.404   0.6862    
## kw_avg_min  -4.473e-01  4.407e-01  -1.015   0.3100    
## kw_min_max  -4.369e-03  1.822e-03  -2.397   0.0165 *  
## kw_max_max  -6.686e-04  9.088e-04  -0.736   0.4619    
## kw_avg_max   1.845e-03  1.097e-03   1.681   0.0928 .  
## kw_min_avg  -5.389e-01  1.160e-01  -4.647 3.41e-06 ***
## kw_max_avg  -2.017e-01  3.440e-02  -5.862 4.72e-09 ***
## kw_avg_avg   1.811e+00  1.832e-01   9.887  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9636 on 9990 degrees of freedom
## Multiple R-squared:  0.01942,    Adjusted R-squared:  0.01853 
## F-statistic: 21.98 on 9 and 9990 DF,  p-value: < 2.2e-16

From m6, we find that the significant variables are: kw_max_min, kw_avg_min, kw_max_avg, kw_avg_avg

We will run a model with these variables and see if we can reduce them further.

summary(lm(data = keys, shares ~ kw_max_min+ kw_avg_min+ kw_max_avg+ kw_avg_avg))
## 
## Call:
## lm(formula = shares ~ kw_max_min + kw_avg_min + kw_max_avg + 
##     kw_avg_avg, data = keys)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24921  -2111  -1355   -224 648275 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -133.81918  295.31594  -0.453    0.650    
## kw_max_min    -0.02590    0.08235  -0.314    0.753    
## kw_avg_min    -0.07703    0.41447  -0.186    0.853    
## kw_max_avg    -0.12252    0.02805  -4.368 1.27e-05 ***
## kw_avg_avg     1.30405    0.12445  10.478  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9651 on 9995 degrees of freedom
## Multiple R-squared:  0.01581,    Adjusted R-squared:  0.01542 
## F-statistic: 40.15 on 4 and 9995 DF,  p-value: < 2.2e-16
#Rerun the model by removing the variables with p-value > 0.05
summary(lm(data = keys, shares ~ kw_max_avg+ kw_avg_avg))
## 
## Call:
## lm(formula = shares ~ kw_max_avg + kw_avg_avg, data = keys)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27910  -2109  -1357   -234 648185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -131.49908  292.24325  -0.450    0.653    
## kw_max_avg    -0.13357    0.02667  -5.009 5.58e-07 ***
## kw_avg_avg     1.30595    0.12443  10.495  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9651 on 9997 degrees of freedom
## Multiple R-squared:  0.01565,    Adjusted R-squared:  0.01545 
## F-statistic: 79.46 on 2 and 9997 DF,  p-value: < 2.2e-16

From the model above, we will keep kw_max_avg and kw_avg_avg.

Block: NLP (Natural language processing)

m7 = lm(data = nlp, shares ~ .)
summary(m7)
## 
## Call:
## lm(formula = shares ~ ., data = nlp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7912  -2269  -1279   -116 647368 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3528.01     709.16   4.975 6.64e-07 ***
## LDA_00                      1078.88     453.45   2.379 0.017366 *  
## LDA_01                      -633.20     511.96  -1.237 0.216187    
## LDA_02                      -987.42     453.11  -2.179 0.029338 *  
## LDA_03                      1848.59     426.07   4.339 1.45e-05 ***
## LDA_04                           NA         NA      NA       NA    
## global_subjectivity         6566.84    1378.34   4.764 1.92e-06 ***
## global_sentiment_polarity   1989.55    2766.66   0.719 0.472085    
## global_rate_positive_words -9734.96   11824.99  -0.823 0.410384    
## global_rate_negative_words 12875.47   22951.96   0.561 0.574827    
## rate_positive_words        -3794.68    1069.25  -3.549 0.000389 ***
## rate_negative_words        -3677.28    1629.28  -2.257 0.024030 *  
## avg_positive_polarity      -2125.51    2270.23  -0.936 0.349165    
## min_positive_polarity       1016.15    1817.12   0.559 0.576030    
## max_positive_polarity       -583.57     685.25  -0.852 0.394444    
## avg_negative_polarity      -2525.79    2077.39  -1.216 0.224071    
## min_negative_polarity        145.44     708.91   0.205 0.837446    
## max_negative_polarity        177.77    1725.69   0.103 0.917953    
## title_subjectivity            87.08     356.31   0.244 0.806926    
## title_sentiment_polarity    1029.88     388.14   2.653 0.007982 ** 
## abs_title_subjectivity      1396.15     609.85   2.289 0.022079 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9663 on 9980 degrees of freedom
## Multiple R-squared:  0.0149, Adjusted R-squared:  0.01303 
## F-statistic: 7.946 on 19 and 9980 DF,  p-value: < 2.2e-16
#Since in this category we also have a lot of variable, we will divide it further into sub categories:
#Create a sub-category 1 for LDA:
lda = dsmall %>%
  select(LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, shares)
  
m8 = lm(data = lda, shares ~ .)
summary(m8)
## 
## Call:
## lm(formula = shares ~ ., data = lda)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5015  -2190  -1349   -306 649250 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2814.5      275.2  10.226  < 2e-16 ***
## LDA_00         874.0      449.2   1.945  0.05175 .  
## LDA_01        -484.9      501.5  -0.967  0.33358    
## LDA_02       -1211.1      432.4  -2.801  0.00511 ** 
## LDA_03        2540.1      402.6   6.310 2.91e-10 ***
## LDA_04            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9683 on 9995 degrees of freedom
## Multiple R-squared:  0.009271,   Adjusted R-squared:  0.008874 
## F-statistic: 23.38 on 4 and 9995 DF,  p-value: < 2.2e-16
#Since we have NA in the results, we will try to remove LDA_00, LDA_01 (with the highest p-value) and rerun
lda = dsmall %>%
  select(LDA_02, LDA_03, LDA_04, shares)
 
m9 = lm(data = lda, shares ~ .)
summary(m9)
## 
## Call:
## lm(formula = shares ~ ., data = lda)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4970  -2195  -1360   -274 649454 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3132.9      232.1  13.496  < 2e-16 ***
## LDA_02       -1531.6      395.6  -3.871 0.000109 ***
## LDA_03        2165.7      393.1   5.509 3.69e-08 ***
## LDA_04        -313.2      391.8  -0.799 0.424055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9686 on 9996 degrees of freedom
## Multiple R-squared:  0.008627,   Adjusted R-squared:  0.008329 
## F-statistic:    29 on 3 and 9996 DF,  p-value: < 2.2e-16

We will try to run a model with only LDA_02 and LDA_03 to see if both are really significant

summary(lm(data = lda, shares~ LDA_02+ LDA_03))
## 
## Call:
## lm(formula = shares ~ LDA_02 + LDA_03, data = lda)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4981  -2175  -1367   -306 649551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2998.6      160.2  18.713  < 2e-16 ***
## LDA_02       -1406.2      363.2  -3.872 0.000109 ***
## LDA_03        2315.2      345.8   6.695 2.28e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9686 on 9997 degrees of freedom
## Multiple R-squared:  0.008564,   Adjusted R-squared:  0.008365 
## F-statistic: 43.17 on 2 and 9997 DF,  p-value: < 2.2e-16

From the results, we will keep both LDA_02 and LDA_03.

#Sub-category 2 measuring polarity
pol = dsmall %>%
  select(global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity, avg_negative_polarity, min_negative_polarity, max_negative_polarity, title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, shares)

m10 = lm(data = pol, shares ~ .)
summary(m10)
## 
## Call:
## lm(formula = shares ~ ., data = pol)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7467  -2276  -1419   -219 647607 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4230.9      633.5   6.678 2.55e-11 ***
## global_subjectivity          7699.0     1360.6   5.659 1.57e-08 ***
## global_sentiment_polarity    2235.7     2768.6   0.808  0.41939    
## global_rate_positive_words  -9695.2    11784.2  -0.823  0.41068    
## global_rate_negative_words  21857.9    22758.9   0.960  0.33687    
## rate_positive_words         -5224.0     1036.8  -5.039 4.77e-07 ***
## rate_negative_words         -5850.4     1577.6  -3.708  0.00021 ***
## avg_positive_polarity       -1300.1     2269.3  -0.573  0.56672    
## min_positive_polarity        1376.7     1807.3   0.762  0.44620    
## max_positive_polarity        -543.7      686.1  -0.792  0.42810    
## avg_negative_polarity       -3689.5     2056.2  -1.794  0.07279 .  
## min_negative_polarity         534.1      705.1   0.758  0.44875    
## max_negative_polarity         244.2     1726.9   0.141  0.88756    
## title_subjectivity            177.5      356.0   0.499  0.61810    
## title_sentiment_polarity     1102.4      388.9   2.835  0.00460 ** 
## abs_title_subjectivity       1521.6      610.1   2.494  0.01265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9686 on 9984 degrees of freedom
## Multiple R-squared:  0.009781,   Adjusted R-squared:  0.008293 
## F-statistic: 6.574 on 15 and 9984 DF,  p-value: 2.864e-14
#Again, run the model with the variables that have p-value < 0.05.

pol2 = dsmall %>%
  select(global_subjectivity, rate_positive_words, rate_negative_words,  avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares)

m11 = lm(data = pol2, shares ~ .)
summary(m11)
## 
## Call:
## lm(formula = shares ~ ., data = pol2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6304  -2288  -1439   -221 647871 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4337.5      599.2   7.238 4.87e-13 ***
## global_subjectivity        7921.9     1172.2   6.758 1.48e-11 ***
## rate_positive_words       -5993.6      804.3  -7.452 9.98e-14 ***
## rate_negative_words       -5788.2      895.3  -6.465 1.06e-10 ***
## avg_negative_polarity     -2182.2      903.5  -2.415  0.01574 *  
## title_sentiment_polarity   1090.2      381.9   2.855  0.00431 ** 
## abs_title_subjectivity     1375.2      532.5   2.582  0.00983 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9684 on 9993 degrees of freedom
## Multiple R-squared:  0.009223,   Adjusted R-squared:  0.008628 
## F-statistic:  15.5 on 6 and 9993 DF,  p-value: < 2.2e-16

For this second part, we’ll keep global_subjectivity, rate_positive_words, rate_negative_words, avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares.

Build Global Regression Model

We will use the step-wise function to identify the most significant variables out of the ones resulting from the previous “block” analysis

#Create a new dataset with the variables that we have selected
regData = dsmall %>%
  select(n_unique_tokens, n_non_stop_unique_tokens, average_token_length, n_tokens_content, num_hrefs, num_self_hrefs, self_reference_min_shares, num_imgs, num_videos, is_weekend , kw_max_avg, kw_avg_avg, LDA_03, LDA_02, global_subjectivity, rate_positive_words, rate_negative_words,  avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
null= lm(data=regData, shares ~ 1)  
full = lm(data=regData, shares ~ .)

step = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")
## Start:  AIC=183653.2
## shares ~ 1
## 
##                             Df  Sum of Sq        RSS    AIC
## + self_reference_min_shares  1 1.7624e+10 9.2833e+11 183467
## + kw_avg_avg                 1 1.2466e+10 9.3349e+11 183523
## + LDA_03                     1 6.6944e+09 9.3926e+11 183584
## + kw_max_avg                 1 4.5429e+09 9.4141e+11 183607
## + LDA_02                     1 3.8961e+09 9.4206e+11 183614
## + global_subjectivity        1 1.6742e+09 9.4428e+11 183637
## + average_token_length       1 1.3076e+09 9.4465e+11 183641
## + num_hrefs                  1 1.2795e+09 9.4468e+11 183642
## + avg_negative_polarity      1 9.9203e+08 9.4496e+11 183645
## + num_imgs                   1 9.6777e+08 9.4499e+11 183645
## + title_sentiment_polarity   1 6.1864e+08 9.4534e+11 183649
## + is_weekend                 1 4.5474e+08 9.4550e+11 183650
## + n_tokens_content           1 3.4937e+08 9.4561e+11 183652
## + num_videos                 1 3.4267e+08 9.4561e+11 183652
## + abs_title_subjectivity     1 2.8085e+08 9.4567e+11 183652
## + n_non_stop_unique_tokens   1 2.6692e+08 9.4569e+11 183652
## + rate_positive_words        1 2.4199e+08 9.4571e+11 183653
## <none>                                    9.4596e+11 183653
## + rate_negative_words        1 1.5979e+08 9.4580e+11 183654
## + num_self_hrefs             1 3.4131e+07 9.4592e+11 183655
## + n_unique_tokens            1 4.3950e+06 9.4595e+11 183655
## 
## Step:  AIC=183467.1
## shares ~ self_reference_min_shares
## 
##                            Df  Sum of Sq        RSS    AIC
## + kw_avg_avg                1 9261479023 9.1907e+11 183369
## + LDA_03                    1 6072682385 9.2226e+11 183404
## + LDA_02                    1 3230087597 9.2510e+11 183434
## + kw_max_avg                1 2607956330 9.2572e+11 183441
## + average_token_length      1 1636550645 9.2669e+11 183451
## + num_hrefs                 1 1301601281 9.2703e+11 183455
## + global_subjectivity       1 1173555083 9.2716e+11 183456
## + num_imgs                  1  912920403 9.2742e+11 183459
## + avg_negative_polarity     1  772751755 9.2756e+11 183461
## + title_sentiment_polarity  1  631614916 9.2770e+11 183462
## + n_non_stop_unique_tokens  1  480894368 9.2785e+11 183464
## + is_weekend                1  480679156 9.2785e+11 183464
## + num_videos                1  402943358 9.2793e+11 183465
## + abs_title_subjectivity    1  306088646 9.2803e+11 183466
## + rate_positive_words       1  304643514 9.2803e+11 183466
## + rate_negative_words       1  232618574 9.2810e+11 183467
## + n_tokens_content          1  195498759 9.2814e+11 183467
## <none>                                   9.2833e+11 183467
## + n_unique_tokens           1   21773542 9.2831e+11 183469
## + num_self_hrefs            1    4486089 9.2833e+11 183469
## 
## Step:  AIC=183368.9
## shares ~ self_reference_min_shares + kw_avg_avg
## 
##                            Df  Sum of Sq        RSS    AIC
## + kw_max_avg                1 2891029781 9.1618e+11 183339
## + LDA_03                    1 1819256044 9.1725e+11 183351
## + LDA_02                    1 1144589054 9.1793e+11 183358
## + average_token_length      1  855659040 9.1821e+11 183362
## + global_subjectivity       1  737201041 9.1833e+11 183363
## + num_hrefs                 1  519432895 9.1855e+11 183365
## + title_sentiment_polarity  1  424538302 9.1865e+11 183366
## + avg_negative_polarity     1  414657026 9.1866e+11 183366
## + abs_title_subjectivity    1  381084214 9.1869e+11 183367
## + num_imgs                  1  309000951 9.1876e+11 183368
## + is_weekend                1  298413233 9.1877e+11 183368
## <none>                                   9.1907e+11 183369
## + n_non_stop_unique_tokens  1  158908780 9.1891e+11 183369
## + rate_negative_words       1  130615678 9.1894e+11 183369
## + rate_positive_words       1   99576521 9.1897e+11 183370
## + num_videos                1   84893840 9.1898e+11 183370
## + n_tokens_content          1   50381270 9.1902e+11 183370
## + n_unique_tokens           1    4617967 9.1907e+11 183371
## + num_self_hrefs            1    3451655 9.1907e+11 183371
## 
## Step:  AIC=183339.4
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg
## 
##                            Df Sum of Sq        RSS    AIC
## + LDA_03                    1 754531099 9.1542e+11 183333
## + global_subjectivity       1 648783454 9.1553e+11 183334
## + average_token_length      1 532539330 9.1565e+11 183336
## + LDA_02                    1 522478455 9.1566e+11 183336
## + num_hrefs                 1 424432441 9.1575e+11 183337
## + title_sentiment_polarity  1 400275352 9.1578e+11 183337
## + abs_title_subjectivity    1 372119630 9.1581e+11 183337
## + avg_negative_polarity     1 349069360 9.1583e+11 183338
## + is_weekend                1 259694181 9.1592e+11 183339
## <none>                                  9.1618e+11 183339
## + num_imgs                  1 165271159 9.1601e+11 183340
## + rate_negative_words       1  81706676 9.1610e+11 183340
## + n_non_stop_unique_tokens  1  63746334 9.1612e+11 183341
## + rate_positive_words       1  39072809 9.1614e+11 183341
## + num_videos                1  29199118 9.1615e+11 183341
## + n_tokens_content          1  13697536 9.1617e+11 183341
## + num_self_hrefs            1   1942967 9.1618e+11 183341
## + n_unique_tokens           1    349580 9.1618e+11 183341
## 
## Step:  AIC=183333.1
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03
## 
##                            Df Sum of Sq        RSS    AIC
## + global_subjectivity       1 578956953 9.1485e+11 183329
## + average_token_length      1 401036239 9.1502e+11 183331
## + title_sentiment_polarity  1 394544061 9.1503e+11 183331
## + abs_title_subjectivity    1 371354003 9.1505e+11 183331
## + num_hrefs                 1 345774501 9.1508e+11 183331
## + LDA_02                    1 308393023 9.1512e+11 183332
## + avg_negative_polarity     1 266847149 9.1516e+11 183332
## + is_weekend                1 248037755 9.1518e+11 183332
## <none>                                  9.1542e+11 183333
## + rate_negative_words       1  88844302 9.1534e+11 183334
## + num_imgs                  1  86783248 9.1534e+11 183334
## + n_non_stop_unique_tokens  1  36702838 9.1539e+11 183335
## + rate_positive_words       1   9893647 9.1541e+11 183335
## + n_unique_tokens           1   2126776 9.1542e+11 183335
## + num_videos                1   1343418 9.1542e+11 183335
## + n_tokens_content          1    230405 9.1542e+11 183335
## + num_self_hrefs            1    189094 9.1542e+11 183335
## 
## Step:  AIC=183328.8
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity
## 
##                            Df  Sum of Sq        RSS    AIC
## + average_token_length      1 1961546251 9.1288e+11 183309
## + n_non_stop_unique_tokens  1  482050530 9.1436e+11 183326
## + abs_title_subjectivity    1  374372331 9.1447e+11 183327
## + title_sentiment_polarity  1  368699153 9.1448e+11 183327
## + rate_positive_words       1  305436932 9.1454e+11 183327
## + is_weekend                1  236455112 9.1461e+11 183328
## + n_unique_tokens           1  213924854 9.1463e+11 183328
## + num_hrefs                 1  199983754 9.1465e+11 183329
## + LDA_02                    1  191180220 9.1465e+11 183329
## <none>                                   9.1485e+11 183329
## + rate_negative_words       1  158765158 9.1469e+11 183329
## + num_imgs                  1   56723615 9.1479e+11 183330
## + avg_negative_polarity     1   41222502 9.1480e+11 183330
## + n_tokens_content          1   15611257 9.1483e+11 183331
## + num_self_hrefs            1   10416129 9.1483e+11 183331
## + num_videos                1    6714064 9.1484e+11 183331
## 
## Step:  AIC=183309.3
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length
## 
##                            Df Sum of Sq        RSS    AIC
## + num_hrefs                 1 525751997 9.1236e+11 183306
## + abs_title_subjectivity    1 448993607 9.1243e+11 183306
## + title_sentiment_polarity  1 273145684 9.1261e+11 183308
## + is_weekend                1 249367721 9.1263e+11 183309
## <none>                                  9.1288e+11 183309
## + avg_negative_polarity     1 140520049 9.1274e+11 183310
## + n_unique_tokens           1 101520064 9.1278e+11 183310
## + num_imgs                  1  98252035 9.1279e+11 183310
## + LDA_02                    1  51202913 9.1283e+11 183311
## + n_non_stop_unique_tokens  1   4832472 9.1288e+11 183311
## + num_videos                1   4433457 9.1288e+11 183311
## + rate_positive_words       1   1104713 9.1288e+11 183311
## + n_tokens_content          1    943086 9.1288e+11 183311
## + num_self_hrefs            1    588924 9.1288e+11 183311
## + rate_negative_words       1    529928 9.1288e+11 183311
## 
## Step:  AIC=183305.6
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs
## 
##                            Df Sum of Sq        RSS    AIC
## + n_unique_tokens           1 464109349 9.1189e+11 183302
## + abs_title_subjectivity    1 443667517 9.1191e+11 183303
## + title_sentiment_polarity  1 242749403 9.1212e+11 183305
## + is_weekend                1 195360065 9.1216e+11 183305
## <none>                                  9.1236e+11 183306
## + n_non_stop_unique_tokens  1 168143118 9.1219e+11 183306
## + n_tokens_content          1 149328442 9.1221e+11 183306
## + num_self_hrefs            1 129594404 9.1223e+11 183306
## + avg_negative_polarity     1 123432778 9.1223e+11 183306
## + LDA_02                    1  67786118 9.1229e+11 183307
## + num_videos                1  16085170 9.1234e+11 183307
## + num_imgs                  1   7728638 9.1235e+11 183307
## + rate_negative_words       1   1751509 9.1236e+11 183308
## + rate_positive_words       1    143432 9.1236e+11 183308
## 
## Step:  AIC=183302.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens
## 
##                            Df Sum of Sq        RSS    AIC
## + abs_title_subjectivity    1 455384929 9.1144e+11 183299
## + title_sentiment_polarity  1 263450931 9.1163e+11 183302
## + is_weekend                1 198389536 9.1170e+11 183302
## <none>                                  9.1189e+11 183302
## + n_non_stop_unique_tokens  1 161904775 9.1173e+11 183303
## + avg_negative_polarity     1 142182718 9.1175e+11 183303
## + num_self_hrefs            1 115956730 9.1178e+11 183303
## + num_imgs                  1 103641992 9.1179e+11 183303
## + LDA_02                    1  35289104 9.1186e+11 183304
## + num_videos                1  21697807 9.1187e+11 183304
## + rate_positive_words       1   5565899 9.1189e+11 183304
## + rate_negative_words       1   4290101 9.1189e+11 183304
## + n_tokens_content          1   2395209 9.1189e+11 183304
## 
## Step:  AIC=183299.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens + abs_title_subjectivity
## 
##                            Df Sum of Sq        RSS    AIC
## + title_sentiment_polarity  1 479539027 9.1096e+11 183296
## + is_weekend                1 212840430 9.1123e+11 183299
## <none>                                  9.1144e+11 183299
## + n_non_stop_unique_tokens  1 181204661 9.1126e+11 183300
## + avg_negative_polarity     1 139125824 9.1130e+11 183300
## + num_self_hrefs            1 116034695 9.1132e+11 183300
## + num_imgs                  1 111880959 9.1133e+11 183300
## + LDA_02                    1  50652389 9.1139e+11 183301
## + num_videos                1  17154284 9.1142e+11 183301
## + n_tokens_content          1   3471924 9.1144e+11 183301
## + rate_positive_words       1   2174158 9.1144e+11 183301
## + rate_negative_words       1   1694519 9.1144e+11 183301
## 
## Step:  AIC=183296.2
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity
## 
##                            Df Sum of Sq        RSS    AIC
## + avg_negative_polarity     1 224796621 9.1073e+11 183296
## + is_weekend                1 210770015 9.1075e+11 183296
## <none>                                  9.1096e+11 183296
## + n_non_stop_unique_tokens  1 175387398 9.1078e+11 183296
## + num_self_hrefs            1 122646171 9.1084e+11 183297
## + num_imgs                  1 105435505 9.1085e+11 183297
## + LDA_02                    1  35399106 9.1092e+11 183298
## + rate_positive_words       1  33719598 9.1093e+11 183298
## + rate_negative_words       1  30981234 9.1093e+11 183298
## + num_videos                1  17249752 9.1094e+11 183298
## + n_tokens_content          1   3269191 9.1096e+11 183298
## 
## Step:  AIC=183295.8
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity + 
##     avg_negative_polarity
## 
##                            Df Sum of Sq        RSS    AIC
## + n_non_stop_unique_tokens  1 210141605 9.1052e+11 183295
## + is_weekend                1 207824509 9.1053e+11 183295
## <none>                                  9.1073e+11 183296
## + num_self_hrefs            1 116082594 9.1062e+11 183296
## + num_imgs                  1 105366838 9.1063e+11 183297
## + LDA_02                    1  42919130 9.1069e+11 183297
## + num_videos                1  27455246 9.1071e+11 183297
## + rate_positive_words       1   4139532 9.1073e+11 183298
## + rate_negative_words       1   1742642 9.1073e+11 183298
## + n_tokens_content          1    752353 9.1073e+11 183298
## 
## Step:  AIC=183295.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity + 
##     avg_negative_polarity + n_non_stop_unique_tokens
## 
##                       Df Sum of Sq        RSS    AIC
## + is_weekend           1 200004774 9.1032e+11 183295
## <none>                             9.1052e+11 183295
## + num_self_hrefs       1 102586799 9.1042e+11 183296
## + n_tokens_content     1  54690929 9.1047e+11 183297
## + num_imgs             1  45770362 9.1048e+11 183297
## + LDA_02               1  42049263 9.1048e+11 183297
## + num_videos           1  16449091 9.1051e+11 183297
## + rate_negative_words  1   6524555 9.1052e+11 183297
## + rate_positive_words  1    904379 9.1052e+11 183297
## 
## Step:  AIC=183295.3
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg + 
##     LDA_03 + global_subjectivity + average_token_length + num_hrefs + 
##     n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity + 
##     avg_negative_polarity + n_non_stop_unique_tokens + is_weekend
## 
##                       Df Sum of Sq        RSS    AIC
## <none>                             9.1032e+11 183295
## + num_self_hrefs       1 107928560 9.1022e+11 183296
## + n_tokens_content     1  46036605 9.1028e+11 183297
## + LDA_02               1  45348893 9.1028e+11 183297
## + num_imgs             1  41377684 9.1028e+11 183297
## + num_videos           1  12287905 9.1031e+11 183297
## + rate_negative_words  1   5350473 9.1032e+11 183297
## + rate_positive_words  1    447435 9.1032e+11 183297
summary(step)
## 
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + LDA_03 + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity + 
##     avg_negative_polarity + n_non_stop_unique_tokens + is_weekend, 
##     data = regData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42296  -2126  -1122     -1 628950 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.275e+03  6.968e+02   1.829 0.067395 .  
## self_reference_min_shares  6.145e-02  4.788e-03  12.835  < 2e-16 ***
## kw_avg_avg                 8.891e-01  1.420e-01   6.260 4.01e-10 ***
## kw_max_avg                -1.003e-01  2.803e-02  -3.578 0.000348 ***
## LDA_03                     2.156e+02  4.016e+02   0.537 0.591457    
## global_subjectivity        4.224e+03  1.123e+03   3.762 0.000169 ***
## average_token_length      -1.002e+03  1.944e+02  -5.156 2.57e-07 ***
## num_hrefs                  2.767e+01  1.012e+01   2.734 0.006274 ** 
## n_unique_tokens            5.379e+03  2.186e+03   2.460 0.013895 *  
## abs_title_subjectivity     1.491e+03  5.258e+02   2.835 0.004585 ** 
## title_sentiment_polarity   9.269e+02  3.729e+02   2.486 0.012942 *  
## avg_negative_polarity     -1.445e+03  8.630e+02  -1.675 0.093970 .  
## n_non_stop_unique_tokens  -3.061e+03  2.055e+03  -1.490 0.136316    
## is_weekend                 4.198e+02  2.834e+02   1.481 0.138581    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9548 on 9986 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03641 
## F-statistic: 30.07 on 13 and 9986 DF,  p-value: < 2.2e-16
detach("package:MASS")

Since we have a very low R-squared / Adjusted R-Squared, we will try to visualize the standardized residuals:

qplot(predict(step), rstandard(step), geom="point", xlim = c(0,10000)) + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
## Warning: Removed 50 rows containing missing values (geom_point).

#We will try to remove the variables with the highest p-value and rerun 
mod = lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity,
    data = regData)
summary(mod)
## 
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity, 
##     data = regData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42304  -2146  -1148      4 629298 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.058e+03  6.757e+02   1.566 0.117345    
## self_reference_min_shares  6.142e-02  4.788e-03  12.827  < 2e-16 ***
## kw_avg_avg                 9.837e-01  1.298e-01   7.577 3.84e-14 ***
## kw_max_avg                -1.128e-01  2.696e-02  -4.184 2.88e-05 ***
## global_subjectivity        4.654e+03  1.048e+03   4.440 9.10e-06 ***
## average_token_length      -1.080e+03  1.833e+02  -5.889 4.02e-09 ***
## num_hrefs                  3.258e+01  9.897e+00   3.292 0.000998 ***
## n_unique_tokens            2.638e+03  1.013e+03   2.604 0.009223 ** 
## abs_title_subjectivity     1.431e+03  5.253e+02   2.724 0.006469 ** 
## title_sentiment_polarity   8.451e+02  3.693e+02   2.288 0.022134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9550 on 9990 degrees of freedom
## Multiple R-squared:  0.03687,    Adjusted R-squared:  0.036 
## F-statistic: 42.49 on 9 and 9990 DF,  p-value: < 2.2e-16
#Again, remove the variables with a high p-value
summary(lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs + n_unique_tokens + abs_title_subjectivity, data = regData))
## 
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens + abs_title_subjectivity, data = regData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42055  -2153  -1158      0 629293 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.230e+03  6.716e+02   1.831 0.067183 .  
## self_reference_min_shares  6.135e-02  4.789e-03  12.811  < 2e-16 ***
## kw_avg_avg                 9.885e-01  1.298e-01   7.614 2.91e-14 ***
## kw_max_avg                -1.129e-01  2.696e-02  -4.189 2.83e-05 ***
## global_subjectivity        4.791e+03  1.047e+03   4.577 4.78e-06 ***
## average_token_length      -1.091e+03  1.833e+02  -5.954 2.71e-09 ***
## num_hrefs                  3.317e+01  9.896e+00   3.352 0.000804 ***
## n_unique_tokens            2.557e+03  1.013e+03   2.525 0.011592 *  
## abs_title_subjectivity     1.146e+03  5.105e+02   2.246 0.024753 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9552 on 9991 degrees of freedom
## Multiple R-squared:  0.03637,    Adjusted R-squared:  0.03559 
## F-statistic: 47.13 on 8 and 9991 DF,  p-value: < 2.2e-16

Our final model is the following:

model = lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs + n_unique_tokens, data = regData)
summary(model)
## 
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens, data = regData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42338  -2137  -1170    -25 629505 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.593e+03  6.519e+02   2.444 0.014538 *  
## self_reference_min_shares  6.132e-02  4.790e-03  12.801  < 2e-16 ***
## kw_avg_avg                 9.894e-01  1.299e-01   7.619 2.79e-14 ***
## kw_max_avg                -1.137e-01  2.697e-02  -4.214 2.53e-05 ***
## global_subjectivity        4.732e+03  1.047e+03   4.521 6.22e-06 ***
## average_token_length      -1.076e+03  1.832e+02  -5.872 4.45e-09 ***
## num_hrefs                  3.320e+01  9.898e+00   3.355 0.000797 ***
## n_unique_tokens            2.531e+03  1.013e+03   2.499 0.012454 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9554 on 9992 degrees of freedom
## Multiple R-squared:  0.03588,    Adjusted R-squared:  0.0352 
## F-statistic: 53.12 on 7 and 9992 DF,  p-value: < 2.2e-16

Model Accuracy

Since we had a low R-Squared, we decide to calculate the accuracy of our model to see how well it predicts the number of shares.

In order to measure the accuracy of the model, and taking into consideration the article “Predicting and Evaluating the Popularity of Online News”, we decided to use the Accuracy indicator that can be found in the table IV of the article.

We will split the variable “shares” into 2 groups of equal size and generate a prediction that will also be split into 2 groups of equal size. Then, the logic of the indicador is to measure how many of the predictions is correct (sum of the number of true positives and true negatives compared to the size of the whole sample).

#Here, we switch to the whole dataset instead of the sample dataset.
summary(data$shares)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     946    1400    3395    2800  843300
#We will add a column aboveMedian to classify the data according to the number of shares with respect to the median in the available dataset. 
#We will switch to work with the whole dataset
data = data %>%
  mutate(aboveMedian = ifelse(shares>= 1400, 1, -1))

fitted.results = predict(model,data,type='response')
#expected median
med = median(fitted.results)
fitted.results = ifelse(fitted.results >= med,1,-1)

misClasificError = mean(fitted.results != data$aboveMedian )
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.591665825850066"

Check Model Assumptions

A1: Checking Linear Relation

qplot(predict(model), rstandard(model), geom="point", xlim= c(0,10000), ylim = c(-1,20)) + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
## Warning: Removed 48 rows containing missing values (geom_point).

Residuals are not distributed in a symetrical way around the y-axis suggesting there is no clear linear relation between our predictor and the number of shares. There is a very evident tendency of the errors of being spreaded out more above the x-axis than below it.

This is related with the fact that the number shares is not evenly distributed as there are very few cases in which the number of shares is extremely high.

A2: Checking Normality

We check normality plotting a QQ-plot of the residual as well as a histogram of the residuals.

# Check normality using histogram
q1 = qplot(rstandard(model), geom="blank", xlim = c(-1,10)) +
  geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.2)+
  stat_function(fun=dnorm, args=list(mean=0, sd=1),
                colour=I("red"), alpha=I(0.5))
# Check normality using qqplot
q2 = qplot(sample=rstandard(model)) +
  geom_abline(slope=1,intercept=0)

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(q1, q2, nrow=1)
## Warning: Removed 28 rows containing non-finite values (stat_bin).

Residuals don’t seem to be normally distributed. The distribution of residuals is asymetrical and, even if it’s expected value is 0, there is a clear case of skewness.

A3: Checking Homoscedasticity

qplot(predict(model), rstandard(model), geom="point", ylim = c(-10,20)) + geom_hline(yintercept=0) +
  geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
  geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))
## Warning: Removed 2 rows containing missing values (geom_point).

The homoscedasticity assumption is broken as the chart shows a significant amount of points outside of 2-sided standard deviations range of the graph.

Therefore, we will try to find the optimal transformation for the “y”" variable (shares) in order to obtain a homoscedastic model.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
spreadLevelPlot(model) 
## Warning in spreadLevelPlot.lm(model): 7 negative fitted values removed

## 
## Suggested power transformation:  -0.3020725
# the suggested transformation is -0.3020725 

We will try to use the transformed variable and see if the model improves

regData = regData %>%
  mutate(sharesPower = shares^-0.3020725)

model2 = lm(data = regData, formula = sharesPower ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs + n_unique_tokens)

summary(model2)
## 
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens, data = regData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08901 -0.01641  0.00217  0.01615  0.50195 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.234e-01  1.861e-03  66.312  < 2e-16 ***
## self_reference_min_shares -8.805e-08  1.367e-08  -6.441 1.24e-10 ***
## kw_avg_avg                -6.731e-06  3.706e-07 -18.162  < 2e-16 ***
## kw_max_avg                 9.039e-07  7.696e-08  11.744  < 2e-16 ***
## global_subjectivity       -1.825e-02  2.987e-03  -6.110 1.03e-09 ***
## average_token_length       1.409e-03  5.228e-04   2.695 0.007045 ** 
## num_hrefs                 -1.884e-04  2.825e-05  -6.670 2.69e-11 ***
## n_unique_tokens            1.026e-02  2.890e-03   3.551 0.000385 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02727 on 9992 degrees of freedom
## Multiple R-squared:  0.06759,    Adjusted R-squared:  0.06694 
## F-statistic: 103.5 on 7 and 9992 DF,  p-value: < 2.2e-16

We have a slightly higher R-squared. We will continue using this model in the next stages.

qplot(predict(model2), rstandard(model2), geom="point", ylim = c(-10,20)) + geom_hline(yintercept=0) +
  geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
  geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))

We can see that with the new model, the homocedasticity assumption is respected.

A4: Checking Independence

The residuals might have been auto-correlated in the first years of Mashable (founded in 2005) since a very successful article might have led to a higher popularity of the website and therefore it could have influenced the popularity of future articles.

Hence, we want to check if the website is so popular, that it can be considered that there is no time-series effect of the increase in popularity of the website that depends on the number of shares of its articles.

We continue using model 2 to check the rest of the assumptions and run the durbinWatson test to check the independence of the residuals.

library(car)
durbinWatsonTest(model2)
##  lag Autocorrelation D-W Statistic p-value
##    1     0.008251294      1.983457   0.402
##  Alternative hypothesis: rho != 0

We have a high p-value: the residuals are not auto-correlated.

Model Validation

We performed a final validation of the model in order to calculate its performance comparing the training set and a testing set.

We first split the dataset:

library(caTools)
set.seed(17)

data = data %>%
  mutate(sharesPower = shares^-0.3020725)

split = sample.split(data$sharesPower, SplitRatio = 0.8) 
training = subset(data, split==TRUE)
testing = subset(data, split==FALSE)

We then re-calculate the model with the new training database.

fit <- lm(data= training, sharesPower ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs + n_unique_tokens)

summary(fit)
## 
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs + n_unique_tokens, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09283 -0.01671  0.00202  0.01660  0.90756 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.263e-01  1.069e-03 118.143  < 2e-16 ***
## self_reference_min_shares -5.894e-08  7.601e-09  -7.755 9.12e-15 ***
## kw_avg_avg                -7.348e-06  2.134e-07 -34.432  < 2e-16 ***
## kw_max_avg                 8.855e-07  4.386e-08  20.188  < 2e-16 ***
## global_subjectivity       -2.003e-02  1.721e-03 -11.638  < 2e-16 ***
## average_token_length       2.512e-03  2.441e-04  10.289  < 2e-16 ***
## num_hrefs                 -1.903e-04  1.430e-05 -13.305  < 2e-16 ***
## n_unique_tokens           -5.937e-05  3.984e-05  -1.490    0.136    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02791 on 31809 degrees of freedom
## Multiple R-squared:  0.0712, Adjusted R-squared:  0.07099 
## F-statistic: 348.3 on 7 and 31809 DF,  p-value: < 2.2e-16
#We remove the average_token_lenght from the model because it has a very high p-value.

fit <- lm(data= training, sharesPower ~ self_reference_min_shares + kw_avg_avg + 
    kw_max_avg + global_subjectivity + average_token_length + 
    num_hrefs)

summary(fit)
## 
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg + 
##     kw_max_avg + global_subjectivity + average_token_length + 
##     num_hrefs, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09282 -0.01671  0.00203  0.01659  0.90756 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.263e-01  1.069e-03 118.144  < 2e-16 ***
## self_reference_min_shares -5.894e-08  7.601e-09  -7.755  9.1e-15 ***
## kw_avg_avg                -7.351e-06  2.134e-07 -34.448  < 2e-16 ***
## kw_max_avg                 8.859e-07  4.386e-08  20.199  < 2e-16 ***
## global_subjectivity       -1.997e-02  1.720e-03 -11.606  < 2e-16 ***
## average_token_length       2.499e-03  2.440e-04  10.243  < 2e-16 ***
## num_hrefs                 -1.901e-04  1.430e-05 -13.292  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02791 on 31810 degrees of freedom
## Multiple R-squared:  0.07113,    Adjusted R-squared:  0.07096 
## F-statistic:   406 on 6 and 31810 DF,  p-value: < 2.2e-16
Rsq.Training = summary(fit)$r.squared 

And we finally evaluate the R-Squared for the testing dataset

SSE = (predict(fit, newdata = testing) - testing$sharesPower)^2 %>% sum()
SSTotal = (testing$sharesPower - mean(testing$sharesPower))^2 %>% sum()
Rsq.Testing = (SSTotal - SSE)/SSTotal
cat("R^2 Training = ", Rsq.Training, " vs R^2 Testing ", Rsq.Testing, ".", sep="")
## R^2 Training = 0.07113062 vs R^2 Testing 0.04879411.

Final Considerations

Even if we found a final model that has a set of variables with very low p-values (lower than 10e-15), the final outcome is a model with an R-Squared barely above zero.

The predicted variable has a very particular distribution that has generated various challenges during the modelling process.

We have some ideas about additional actions that could improve the results.

Changes in the response variable

The field shares could be transformed into 3 different categories: Low_Number_Shares (0), High_Number_Shares (1) and an indetermined zone.

The indetermined zone could be excluded from the database (as it is a grey area and might generate some distortion in the modelling process), whereas the Low_Number_Shares (0) and High_Number_Shares (1) distinction might help to distinguish in a clearer way characteristics that influcence the success of an article (measured in # of shares). The selection of the way to split the table has to respond to a logic process and has to take into account the distribution of the information.

As we would work with a binary response variable, it would be appropriate to work with a logit regression model.

Changes in the independent variables

It might be necessary to generate additional variables or transformations to increase the predicting capacity of the variables. We didn’t perform these tasks in this exercise because of the excessive number of independent variables to analyze (almost 60).